Skip to content

Conversation

@relaxis
Copy link

@relaxis relaxis commented Oct 22, 2025

Summary

This PR includes critical bug fixes for WAN 2.2 I2V/T2V training and improvements for video training workflows.


Fixes Included

1. MoE Per-Expert LR Logging Fix ✨ NEW

Problem: LR was averaged across all param groups for MoE models, making it impossible to verify per-expert LR adaptation and state preservation.

Solution:

  • Detect MoE via multiple param groups (BaseSDTrainProcess.py)
  • Display separate LR for each expert: lr0: 5.0e-04 lr1: 3.5e-05
  • Shows which expert is training and tracks independent LR adaptation

Files changed: jobs/process/BaseSDTrainProcess.py

Example output:

# Before (meaningless average)
lr: 2.7e-04 loss: 8.414e-02

# After (clear per-expert visibility)
lr0: 2.8e-05 lr1: 0.0e+00 loss: 8.414e-02  # High Noise active
lr0: 5.2e-05 lr1: 1.0e-05 loss: 7.821e-02  # Low Noise now active, High preserved

2. MoE Transformer Detection Bug Fix ✨ NEW

Problem: _prepare_moe_optimizer_params() checked for .transformer_1. (dots) but lora_name uses $$ separators, so check never matched. All params went into single group instead of separate groups per expert.

Solution:

  • Fixed substring matching to use transformer_1 without dots
  • Now correctly matches names like transformer$$transformer_1$$blocks$$0$$attn1$$to_q
  • Creates proper separate param groups for transformer_1 and transformer_2
  • Enables per-expert lr_bump, min_lr, max_lr with automagic optimizer

Files changed: toolkit/lora_special.py

3. WAN 2.2 I2V Boundary Detection Fix

Problem: The toolkit was hardcoded to use T2V boundary ratio (0.875) for all WAN 2.2 models, causing incorrect timestep distribution for I2V models.

Solution:

  • Auto-detect I2V vs T2V models from model path
  • Use correct boundary ratio: 0.9 for I2V, 0.875 for T2V
  • Fixes dual LoRA (HIGH/LOW noise) training for I2V models

Files changed: extensions_built_in/diffusion_models/wan22/wan22_14b_model.py

4. AdamW8bit OOM Crash Fix

Problem: When OOM occurs during training, the progress bar update attempts to access loss_dict which hasn't been populated, causing a KeyError crash.

Solution:

  • Only update progress bar if training step succeeded (not did_oom)
  • Prevents crash and allows training to continue after OOM recovery

Files changed: jobs/process/BaseSDTrainProcess.py

5. Gradient Norm Logging

Problem: No visibility into gradient norms during training, making it difficult to diagnose divergence and LR issues.

Solution:

  • Added _calculate_grad_norm() method with comprehensive gradient tracking
  • Handles sparse gradients and param groups correctly
  • Logs grad_norm in loss_dict alongside loss
  • Essential for monitoring training stability with adaptive optimizers

Files changed: extensions_built_in/sd_trainer/SDTrainer.py


Features Included

1. Video-Friendly Bucket Resolutions ✨ NEW

Problem: Previous SDXL-oriented buckets caused excessive cropping for video content with common aspect ratios.

Solution:

  • New resolutions_video_1024 with video aspect ratios (16:9, 9:16, 4:3, 3:4)
  • Primary buckets only to avoid undersized assignments
  • Enabled by default with use_video_buckets: true

Benefits:

  • Better aspect ratio preservation
  • Reduced unnecessary cropping
  • Improved training quality for video datasets
  • Backwards compatible (can disable with use_video_buckets: false)

Files changed: toolkit/buckets.py, toolkit/data_loader.py, toolkit/dataloader_mixins.py, toolkit/config_modules.py

2. Pixel Budget Scaling ✨ NEW

Problem: Different aspect ratios used inconsistent resolutions, causing variable memory usage and suboptimal quality.

Solution:

  • New max_pixels_per_frame parameter for memory-based scaling
  • Each aspect ratio is maximized within the pixel budget
  • Example: max_pixels_per_frame: 589824 (768×768) optimally scales all ratios

Benefits:

  • Consistent memory usage across aspect ratios
  • Maximizes resolution for each ratio within memory constraints
  • Better quality without memory surprises
  • Only activates when max_pixels_per_frame is set

Feature Requests

UI/Config Enhancements

1. Automagic Optimizer Support

Request UI fields and validation for the automagic optimizer:

  • min_lr, max_lr, lr_bump, starting lr

Benefit: Automagic is highly effective for WAN 2.2 training but currently requires manual YAML editing.

2. Network Dropout Settings

Add UI field for network.dropout parameter.

Benefit: Dropout helps prevent overfitting in LoRA training, especially important for small datasets.

3. More Custom Resolutions

Add more resolution presets: 256x256, 320x320, 384x384, 448x448, 512x512

Benefit: Different resolutions have different training characteristics.

4. Training Metrics & Graph Plotting

Add built-in metric tracking and visualization:

  • Per-LoRA loss tracking
  • Gradient norm over time
  • Learning rate progression
  • Optional TensorBoard export

Benefit: Currently users must manually parse logs and create graphs.

VRAM Optimization Requests

5. Single LoRA Training Mode for WAN 2.2

Add options to load only HIGH or only LOW noise model.

Benefit: Saves ~7-10GB VRAM by not loading the unused transformer.

6. Fix RAMTorch Implementation for WAN 2.2

Currently doesn't work properly with WAN 2.2 dual transformer architecture.

Benefit: Would enable training on lower VRAM GPUs.

7. PyTorch Nightly + CUDA 13 Support (Blackwell)

Add optional requirements for PyTorch nightly, CUDA 13.x, SM_120.

Benefit: Enables RTX 50-series GPU users to utilize new optimizations.


Testing

All fixes and features have been tested in production WAN 2.2 I2V LoRA training:

  • 59 video dataset with mixed aspect ratios
  • 6000+ steps
  • Automagic optimizer with per-expert parameters
  • Dual LoRA (HIGH/LOW noise) training
  • MoE switching every 100 steps

Results:

  • ✅ Per-expert LR display working correctly
  • ✅ LR state preservation verified at each expert switch
  • ✅ Video buckets properly preserve aspect ratios
  • ✅ Pixel budget scaling maintains consistent memory usage
  • ✅ No OOM crashes
  • ✅ Gradient norm logging provides excellent training visibility

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

AI Toolkit Contributor and others added 3 commits October 22, 2025 10:56
…dient norm logging

This commit includes three critical fixes and one feature addition:

1. WAN 2.2 I2V Boundary Detection Fix:
   - Auto-detect I2V vs T2V models from model path
   - Use correct boundary ratio (0.9 for I2V, 0.875 for T2V)
   - Previous hardcoded T2V boundary caused training issues for I2V models
   - Fixes timestep distribution for dual LoRA (HIGH/LOW noise) training

2. AdamW8bit OOM Loss Access Fix:
   - Prevent crash when accessing loss_dict after OOM event
   - Only update progress bar if training step succeeded (not did_oom)
   - Resolves KeyError when loss_dict is not populated due to OOM

3. Gradient Norm Logging:
   - Add _calculate_grad_norm() method for comprehensive gradient tracking
   - Handles sparse gradients and param groups correctly
   - Logs grad_norm in loss_dict for monitoring training stability
   - Essential for diagnosing divergence and LR issues

These fixes improve training stability and monitoring for WAN 2.2 I2V/T2V models.
This commit introduces two major improvements to bucket allocation for video training:

1. Video-friendly bucket resolutions:
   - New resolutions_video_1024 with common aspect ratios (16:9, 9:16, 4:3, 3:4)
   - Reduces cropping for video content vs the previous SDXL-oriented buckets
   - Primary buckets only to avoid undersized assignments

2. Pixel budget scaling for consistent memory usage:
   - New max_pixels_per_frame parameter allows memory-based scaling
   - Each aspect ratio is maximized within the pixel budget
   - Prevents memory issues with varying aspect ratios
   - Example: max_pixels_per_frame=589824 (768×768) gives optimal dims for each ratio

Benefits:
- Better aspect ratio preservation for video frames
- Consistent memory usage across different aspect ratios
- Improved training quality by reducing unnecessary cropping
- Backwards compatible with existing configurations

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
This commit fixes two critical issues with Mixture of Experts (MoE) training
for dual-transformer models like WAN 2.2 14B I2V:

**Issue 1: Averaged LR logging masked expert-specific behavior**
- Previous logging averaged LR across all param groups (both experts)
- Made it impossible to verify LR was resuming correctly per expert
- Example: High Noise at 0.0005, Low Noise at 0.00001 → logged as 0.00026

**Fix:** Per-expert LR display (BaseSDTrainProcess.py lines 2198-2226)
- Detects MoE via multiple param groups
- Shows separate LR for each expert: "lr0: 5.0e-04 lr1: 3.5e-05"
- Makes expert-specific LR adaptation visible and debuggable

**Issue 2: Transformer detection bug prevented param group splitting**
- _prepare_moe_optimizer_params() checked for '.transformer_1.' (dots)
- But lora_name uses '$$' separator: "transformer$$transformer_1$$blocks..."
- Check never matched, all params went into single group → no per-expert LRs

**Fix:** Corrected substring matching (lora_special.py lines 622-630)
- Changed from '.transformer_1.' to 'transformer_1' substring check
- Now correctly creates separate param groups for transformer_1/transformer_2
- Enables per-expert lr_bump, min_lr, max_lr with automagic optimizer

**Result:**
- Visible per-expert LR adaptation: lr0 and lr1 tracked independently
- Proper LR state preservation when experts switch every N steps
- Accurate monitoring of training progress for each expert

Example output:
```
lr0: 2.8e-05 lr1: 0.0e+00 loss: 8.414e-02  # High Noise active
lr0: 5.2e-05 lr1: 1.0e-05 loss: 7.821e-02  # After switch to Low Noise
lr0: 5.2e-05 lr1: 3.4e-05 loss: 6.103e-02  # Low Noise adapting, High preserved
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant